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Abstract 

The complexity of linear mixed-effects (LME) models means that traditional diag¬ 
nostics are rendered less effective. This is due to a breakdown of asymptotic results, 
boundary issues, and visible patterns in residual plots that are introduced by the 
model fitting process. Some of these issues are well known and adjustments have 
been proposed. Working with LME models typically requires that the analyst keeps 
track of all the special circumstances that may arise. In this paper we illustrate a 
simpler but generally applicable approach to diagnosing LME models. We explain 
how to use new visual inference methods for these purposes. The approach provides a 
unified framework for diagnosing LME fits and for model selection. We illustrate the 
use of this approach on several commonly available data sets. A large-scale Amazon 
Turk study was used to validate the methods. R code is provided for the analyses. 
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1 Introduction 


Model checking is an essential step of statistical modeling that ensures the assumptions 
necessary for valid inference are upheld. This process includes both the search for con¬ 
tradictions of model assumptions and an assessment of how well the model captures the 
characteristics of the data. Such investigation can be carried out using test statistics and 
p-values to gauge the strength of evidence, however, such methods indicate only the degree 
to which there is a problem with the model. Graphical diagnostics enable the analyst to 
detect not only when there is a problem with the model and where it occurs, but also give 
some indication of what may be the cause of the problem. For complex models, the insight 
provided by graphical diagnostics is especially useful, allowing the analyst to develop in¬ 
tuition and make discoveries about the nature of model violations. This would be nearly 
impossible through purely numeric methods. 

While graphical diagnostics develop intuition and make discoveries, it is easy to mis¬ 
interpret a single plot, especially when dealing with more complex models such as linear 
mixed effects (LME) models. For example. Figure [^displays four residual plots used to di¬ 
agnose LMEs: (left) a normal Q-Q plot of the random slopes; (center) box plots displaying 
the error terms by group; and (right) a plot of the residuals vs. a predictor. Which of the 
plots in Figure [T] would you consider indicative of a model violation? 



Figure 1: Four residual plots from different LME models. Which ones, or all, of these 
would you consider to indicate structure not captured by the model? 

Based on our discussions with other statisticians, we have found that all four plots could 
be deemed indicative of a violation; however, only the plot of the residuals vs. the predictor 
is truly problematic. Figure [T] demonstrates that using only a single plot to distinguish 
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between structure introduced by deviations from the model and structure introduced by 
the model htting procedure is difficult, especially for more complicated statistical models 
such as LMEs. The purpose of this paper is to illustrate how visual inference allows us to 
distinguish between structure due to deviations from the model and structure introduced 
by the model htting procedure. Additionally, the proposed graphical tests can be used in 
situations where the assumptions of conventional inferential procedures are violated. While 
we focus only on LMEs, the discussion is general enough to illustrate how visual inference 
can be used to overcome similar issues in other model classes. 

The remainder of this paper is organized as follows. Section [^provides an introduction 
to the framework of visual inference and the lineup protocol. Section [^introduces problems 
encountered in model selection and model checking, which are expanded upon in Sections [^ 
andj^ respectively, and presents solutions utilizing visual inference. Throughout Sections [^ 
and [^multiple examples are used, and comparisons between conclusions on model building 
and ht from the new visual inference approach and existing diagnostic tools are provided. 
All example data sets are readily available for public use. A short description of each data 
set can be found in Section of the supplement. 


2 Visual inference 


Classical statistical inference consists of (i) formulating null and alternative hypotheses, 
(ii) calculating a test statistic from the observed data, (iii) comparing the test statistic 
to a reference (null) distribution, and (iv) deriving a p-value on which a conclusion is 


based. Each of these steps has a direct analog in visual inference, as outlined by Buja 


et al. (2009). This section highlights the parallels between conventional hypothesis tests 


and visual inference in the setting of LME models. 

Assume that the question of interest involves some assumption about a model (such 
as a null hypothesis of homogeneity of residual variance) while the alternative hypothesis 
encompasses any violation of this model assumption. For visual inference, the test statis¬ 
tic corresponds to a plot that displays an aspect of the model assumption and allows the 
observer to distinguish between scenarios under the null hypothesis from scenarios under 
alternative hypotheses. Plots drawn from data generated consistently with the null hy- 
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pothesis are called null plots. The set of all null plots constitutes the reference distribution; 
thus, the plot of the observed data is indistinguishable from the null plots if the model 
assumption holds. In the lineup protocol, the plot of the observed data is randomly em¬ 
bedded among a sample of, usually 19, null plots drawn from the reference distribution. 
These lineups are then presented to independent observers for evaluation. 

Evaluation by independent observers allows for the estimation of a p-value associated 
with the lineup: Let X be the random variable describing the number of observers, out of 
N, identifying the data plot. If X = a; is the number of observers who chose the data plot 
from the lineup, then the p-value is the probability that at least x observers chose the data 
plot, given that the null hypothesis is true (i.e., the data plot is not any different from the 
other plots in the lineup). Under the null hypothesis the probability of choosing the true 
plot is 1/m (for a lineup of size m), and X is distributed according to a distribution similar 
to a Binomial distribution i?Ar,i/m, but adjusted for the dependencies between plots in a 


given lineup. (Majumder et ah, 2013, introduced visual p-values. Details of the calculation 


for this LME model application are given in Section C.l of the supplement.) 

Figure [^shows an example lineup. Each panel presents line segments of different lengths 
with varying slopes. Observers were asked the question ‘Which plot is the most different?’ 
Any information revealing the context of the data, such as axis labels, units, titles, and 


legends, was carefully removed to avoid subjective bias (Meilgaard et ah, 2006), ensuring 
that observers made decisions based purely on the data display. 

Based on a human subject study (described in more detail in Section of the sup¬ 
plement) run through the Amazon MTurk service (Amazon.com, Inc, 2015[ ), 11 out of 73 
observers chose the data plot, shown in panel #(vT44 -|- 4)[^ of the lineup in Figure 
resulting in a visual p-value of 0.0171. This leads us to reject the null hypothesis that the 
data plot is consistent with the model that generated the data, on which all of the null 
plots are based. The model and its corresponding null hypothesis are explained in detail 
in Section m 

Unlike classical hypothesis tests, visual inference allows us to collect additional infor¬ 
mation on what aspect of the display led each observer to their choice. This information 

^We encode the panel number as a mathematical expression to pose a cognitive obstacle, allowing the 
reader to evaluate the lineup before being biased by knowing the answer. 
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Figure 2: One of the lineups used in an Amazon Turk study to test for the necessity of 
random slopes in an LME model. Participants were asked to identify the most ‘different’ 
plot and give a reason for their choice. Of the 68 observers, 10 identihed the data plot as 
the most different. This indicates the data are recognizably different from the null plots, 
indicating the need to include a random slopes component in the model. Further details 


on data, model and the location of the data plot are given in Section 4.2 


makes it possible to assess which part of the null hypothesis is violated, something not 
feasible in classical hypothesis tests. For example, ‘Spread’ as the reason for an observer’s 
choice (over ‘Outlier’, ‘Trend’, ‘Asymmetry’, or ‘Other’) in Figurej^was associated with the 
highest probability of picking the data plot over a null plot (see Table [^of the supplement). 
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3 Issues with LME Models 


In this paper we consider the two-level continuous-response linear mixed-effects model 
(LME) with uncorrelated errors fit either by maximum likelihood or restricted maximum 
likelihood. More specifically, 


Vi — Xi l3 -(- Zi bi -\- Si 

{rtiXl) (riiXp) (pxl) (riiXq) (qxl) (riiXl) 


( 1 ) 


where there are i = 1,... ,g non-overlapping groups, Xi and Zi are design matrices for 
the hxed and random effects, respectively, and /3 denotes the hxed effects, bi denotes the 
random effects for group i, which are assumed to follow a multivariate normal distribution 
and are independent between groups. denotes the error terms, which are assumed to be 
i.i.d. normal random variables. 

Similar to simple linear models, residuals form the diagnostic core of a LME model. 
But, LME model residual analysis is complicated by the fact that there are numerous 
quantities that can be defined as residuals, with each residual quantity being associated 
with different aspects of the model. The two fundamental residuals for model checking 
considered here are: (1) the error terms (i.e. level-1 residuals), and (2) the predicted random 
effects (i.e. level-2 residuals). 

The use of random effects models comes at the cost of complicating model exploration 
and validation. Problems addressed in this paper include: 

1. Model selection: 

(a) Assessment of asymptotic distributions-. Test statistics used for model selection 
and validation rely on asymptotic reference distributions which often perform 
poorly in finite sample situations. For example, the unconditional empirical 
distribution of the predicted random effects does not resemble the theoretical 


distribution unless strong assumptions are met (Jiang, 1998, Theorem 3.2 and 
Lemma 3.1). In many finite sample situations these assumptions do not hold, 
making conventional use of Q-Q plots and tests of the empirical distribution 


function ineffective for distributional assessment (see the supplement of Loy and 


Hofmann, 2015, for supporting simulation results). 


(b) Boundary issues in the comparison of nested models: When evaluating the sig- 
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nificance of terms in the random effects structure, the likelihood ratio test statis¬ 
tic does not have the usual reference distribution if we are testing whether a 
variance component lies on the boundary of the parameter space. This results 
in tests for the random effects that tend to be conservative. 

2. Model checking: As seen in Figure residual plots might display noticeable patterns 
that are artifacts of both the data structure and the model estimation procedure 
rather than indications of lack of ht. This problem is especially pronounced for plots 
of the error terms, as they often exhibit patterns that appear to be indicative of 
heteroscedasticity, but are merely consequences of data imbalances or sparsity. 

The above issues are well-known, and in special circumstances adjustments to the 


methodology have been proposed. For example, Stram and Lee (1994) suggest using a 
50:50 mixture of Xq and Xq+i when testing q versus q + 1 random effects; however, this ad¬ 


justment is not successful in all cases. Lange and Ryan (1989) suggest using weighted Q-Q 
plots to assess the distributional assumptions made on the random effects. This approach 
is effective in settings where the residual variance, cr^, is small relative to the variance 


components for the random effects, but breaks down when this is not the case (Loy and 


Hofmann, 2015). 


In the following sections we expand on the problems encountered in model selection and 
model checking, respectively, and present solutions utilizing visual inference. 


4 Model selection 

Model selection for linear mixed-effects models relies on the comparison of nested models 
for the selection of both the hxed and random components. It is standard practice to use 
a t-test, F-test, or likelihood ratio test to determine whether a hxed effect describes a 
signihcant portion of the unexplained variability. Alternatively, likelihood-based criteria, 
such as AIC or BIC, are often used to overcome the problem of being able to only deal with 
nested models. When selecting random effects, likelihood ratio tests are most commonly 
used. However, situations often arise that complicate such tests. Some of these situations 
are outlined below: 

Fixed effects: Likelihood ratio tests based on REML estimation cannot be used to test 
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different fixed effects structures. Maximum likelihood estimation allows for such 
comparisons, but is anti-conservative. Dehning the appropriate degrees of freedom 
for t- or F-tests provides another complication in testing scenarios of LME models. 
Various approximate F-tests propose solutions for estimating the degrees of freedom 
for these tests, but these typically lead to different results ( Verbeke and Molenberghs] 
2000|. Inflated Type I error rates and low power become a problem in the approximate 


F-test when there are a few groups (Catellier and Muller, 2000). Kenward and Roger 


(1997) propose the use of a scaled Wald statistic that has a sampling distribution that 
is well approximated by an F distribution; however, this approach has inflated type 


I error rates in some small sample cases (Gomez et ah, 2005). Skene and Kenward 


(2010) propose an alternative to the Kenward-Roger approximation that achieves 
nominal type I error rates, but this procedure suffers from low power. 

Random components: When testing for the inclusion of a variance component, the 
parameter being tested lies on the boundary of the parameter space, and the asymp¬ 
totic distribution of the likelihood ratio is no longer Various approaches to deal 


with this problem have been suggested in the literature. Demidenko (2013, Section 
3.5) presents an exact F-test of the null hypothesis D = 0. Unfortunately, the 
all-or-nothing inclusion of random effects limits the use of this test in practice. Ap¬ 


proximations have been suggested and shown to be useful in many situations (Stram 


and Lee, 1994 Morrell, 1998), but no single approximation holds for all situations. 


Alternatively, the rule of thumb suggested by Stram and Lee can be utilized with the 
knowledge that the results may be sub-optimal. This leads to a need for simulation 
studies to determine the proper adjustment to the reference distribution in every 
situation. 

Visual inference provides an alternative to conventional hypothesis tests that does not 
require different rules based on the method of estimation or location of a parameter in the 
parameter space, avoids the tricky business of dehning degrees of freedom, and allows for 
the testing of subsets of the random effects. Rather, visual inference depends on the choice 
of an appropriate plot highlighting the aspect of the model in question, the number of null 
plots, and the number of independent observers. 
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4.1 Fixed effects 


To test the significance of a fixed effect, we snggest using a plot comparing a residual 
quantity from the model without the variable of interest with the values of that variable. 
The residual used depends on the level at which the variable of interest enters the model: 
if the variable enters at the observation-level (level-1), then the level-1 residuals are used; if 
the variable enters at the group-level, then both the level-1 and level-2 residuals are explored 
as the variable has the potential to explain additional variation at either level of the model. 
Additionally, the type of plot depends on the variable type—if a continuous variable is 
targeted, a scatterplot with a smoother is suitable for testing; for a discrete covariate, we 
make use of side-by-side box plots. In this setting, the null plots are generated using the 
parametric bootstrap with a model that omits the variable of interest, details of which can 


be found in Section |C.2[ The true plot is constructed from the same model, but is fit to 
the observed data. 

Figure [^illustrates the use of this type of lineup. In our study, 68 observers were asked 
to choose the plot that is the most different from the rest. Sixty observers identified the 
data plot in panel #(2^ -|- 2), with over 90% pointing to the trend as the distinguishing 
feature. This lineup is chosen to determine whether a child’s language development (low, 
medium, or high) at age two (plotted on the x-axis) is associated with the development of 
social skills for children diagnosed with autism spectrum disorder. Displayed on the y-axis 
are level-1 residuals from a longitudinal model, i.e., the grouping in model j^is given by each 
individual. Clearly, language development at age two accounts for a significant amount of 
the remaining residual variability. 


4.2 Random effects 

Tests of the random part of an LME model focus on two questions: (1) whether a marginal 
random effect improves the model and (2) whether allowing the random effects to be corre¬ 
lated improves the model. Different plots must be used to answer each question. To answer 
the first question, we suggest using plots comparing the response and the explanatory vari¬ 
able of interest using appropriate (often linear) smoothers for each group. Scatterplots 
comparing the predicted random effects can be used to answer the second question. 
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Figure 3: Lineup of level-1 residuals in side-by-side box plots to test significance of a 
discrete covariate. Of the 68 participants, 60 identified the data plot from the lineup. This 
indicates that the covariate is necessary for the model. 


The lineup in Figure was chosen to test the relationship between scores from the 
General Certihcate of Secondary Education Exam (GCSEE) and the standardized London 


Reading Test (LRT) (see Section B.l). Each line segment represents one of 65 inner-London 
schools. The slope of each line is determined by a linear regression relating the two test 
scores for each student at a school. The question of interest is whether random slopes 
for LRT scores are required to represent the relationship between GGSEE and LRT scores 
{Hi). Gorrespondingly, data for the null plots were created by simulating GGSEE scores 
from a model with the random intercept as its only random effect. The resulting scores for 
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each school are regressed on LRT scores and model hts are shown as lines. If the model is 
appropriate, then the observed data should resemble the overall pattern of the lines in the 
null plots. In this example, we hnd that the true plot in panel 7 ^(a/ 144 + 4) is identihable: 
11 of the 73 observers picked the data plot, resulting in a visual p-value of 0.0171. The 
main comments participants gave to explain their choice were the spread and trend of 
the line segments in the plot. This is consistent with a larger variance in the slopes than 
the null model allows; thus, we hnd evidence supporting the inclusion of a random slope 
for standardized LRT. This conclusion agrees with the results of the likelihood ratio test 
(which shows signihcance at a level of less than 0.0001), and did not require the use of an 
asymptotic distribution to calculate the p-value. 

Note that participants are unable to identify the true data from a follow-up lineup (Fig¬ 
ure 1^ in which the random slope was included in the model from which the samples shown 
in the null plots were created (using a parametric bootstrap as outlined in Section C.2[ ) 
None of the 64 observers identihed the data plot in panel ^(2 ■ 3^), providing no evidence 
that the covariance structure for the LRT scores is misspecihed. 

Participants should only see one of Figures or because both show the same data 
in the same design. After viewing the hrst of these hgures we cannot, strictly speaking, 
assume that a participant is still an unbiased judge, because, theoretically, the data panel 
from the second lineup could be identihed by recognizing it as the same panel that was 
previously shown. While the chance of this is slim, we only exposed participants to one of 
a set of dependent lineups. 

Having considered the value of a random slope in the model, we next consider whether 
the model needs to allow correlation between the random effects {Hi). While this is an 
example of a standard likelihood ratio test problem—a correlation of zero is not on the 
boundary of the parameter space—using a lineup keeps all tests of the random effects in a 
unihed framework. The lineup in Figure|^shows scatterplots of the predicted random effects 
with overlaid regression lines. The null plots in the lineup are created by simulation from the 
model that does not allow for correlation between the random effects, and the true plot is 
created using the predicted random effects from such a model £t to the observed data. The 
slopes of the regression lines are indicative of the amount of correlation. If the correlation 
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Figure 4: A follow-up to Figure]^ where the random slope has been included in the model 
from which the null plots are created. None of the 64 observers identified the data plot as 
the most different, providing no evidence of model misspecihcation. 


between the random effects is not necessary, then the true plot will display little correlation 
and be indistinguishable from the null plots. The lineup allows us to gauge the amount 
of correlation between the random effects while accounting for the effect of shrinkage in 


the model, avoiding the over-interpretation of structure in such plots discussed by Morrell 


and Braii^ (2000). In Figure the true plot in panel #(10 -|- a/^) was identified by 36 of 


the 63 observers, providing very strong evidence in support of the additional parameter for 
correlation between random effects, which agrees with a p-value of 0.0041 from the classical 
likelihood ratio test. 
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Figure 5: Lineup for testing the correlation between random effects. Predicted random 
intercept values are plotted on x against predictions of the corresponding random slopes on 
y. Of the 63 observers, 36 identihed the data plot from the lineup, lending strong support 
for a non-zero correlation between the random effects. 

5 Model checking 

In the formulation of model Q we make a number of assumptions that must be satished. In 
this section we discuss how residual plots can be used with lineups to check the assumptions 
of homogeneous residual variance, linearity, and normality of the random effects. While 
we only focus on these assumptions, the discussion is general enough to reveal how visual 
inference can be extended to check other aspects of the model. 
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5.1 Homogeneity of residual variance 


Model Q assumes homogeneity of the within-group variance. To check this assumption we 
must verify (a) the homogeneity of the within-group residual variance across the levels of all 
explanatory variables and (b) check that the within-group variance is also constant between 
groups. Such investigations are often carried out using plots of the level-1 residuals. In 
order to guard against mis- or over-interpretation of the residual plots, we, again, employ 
lineups. 


(a) Checking homogeneity of level-1 residual variance across covariates: Residual 
plots are one of the most-often used tools for checking the relationship between residuals 
and one of the model’s covariates. In Figure]^ we employ a lineup of scatterplots plotting 


the level-1 residuals against pressure in the dialyzer study (see Section B.4). For the null 
plots we derive residuals from model-rehts of properly specihed parametric bootstraps of 
the investigated model. The plot of the observed data is inserted in position ^(2^-|-3). Out 
of 80 evaluations by independent observers the data plot was identihed 26 times, providing 
evidence of heteroscedasticity. 


(b) Checking homogeneity of level-1 residual variance between groups: If a 

covariate is discrete, we have the choice between different ways of visualizing residuals. 
When assessing the homogeneity of level-1 residuals across groups, we are, by default, in 
the situation of comparing continuous values across a discrete range. For the example of 
the dialyzer study, we employed side-by-side box plots and side-by-side dotplots to visualize 


show the resulting lineups. The results from the study clearly support that the design 
matters, and identify the side-by-side box plot as the far more powerful design: more than 
one third of all participants (23/61) identihed the data plot from the lineup of box plots, 
while only one out of 71 participants chose the data plot from the lineup of dot plots. In 
particular, more than half of the participants noted the spread as their reason for choosing 
the data plot from the lineup of box plots. 

Very different group sizes make an assessment of homogeneity in the residual variance 
more complicated, because this imbalance introduces structures in the plot that are largely 
artihcial. To overcome this difficulty, we create side-by-side box plots and order groups 


level-1 residuals against the grouping variable (subjects). Figures [l^ 




J.1J. 




14 





1 


2 


3 


4 


5 



Figure 6 : Lineup testing homogeneity of the level-1 residuals with respect to pressure. 
The data plot is in panel 7 ^ 2 ^ -|- 3. Inspecting only the data plot, we see that an increase in 
pressure coincides with an increase in the variability of level-1 residuals. The surrounding 
null plots provide a reference from a properly specihed model and thus allow us to assess 
differences to the data plot. 


according to their interquartile range (IQR). Because of their shape, we have come to call 
these plots cyclone plots. Figure [^shows a lineup of cyclone plots for 66 patients in a longi¬ 
tudinal study investigating the potential of methylprednisolone to treat patients with severe 


alcoholic hepatitis (see Section B.3 for details). The true plot in panel #(2^ -|- 5) is easily 
identihed from the held of null plots (by 49 of 73 observers) revealing heteroscedasticity 
across groups that might not be apparent in other residual plots. 

While the data plot is easily identihed, any panel from this lineup considered separately 
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exhibits a structure that might, taken by itself, lead an analyst to the conclusion that 
the within-group variance varies across the vertical axis. However, placing the true plot 
into the lineup forces the analyst to consider most of this structure as inherent to the 
data structure rather than evidence against the hypothesis of homogeneity. The fact that 
observers are able to identify the data plot indicates that the data plot has additional 
structure inconsistent with homogeneity. The use of lineups incorporates the comparison 
of the data to what is expected under a properly specified model, eliminating the subjective 
interpretations encountered with the use of single plots. 

An alternative approach to detect heteroscedasticity of the level-1 residuals across 
groups is to use a test based on the standardized measure of dispersion given by 

, _ log (g?) - - a) log (g-) / - Tj)] 

{2/{ni - 


where sf is the residual variance within each group based on separate ordinary least squares 
regressions and r* is the rank of the corresponding model matrix (Raudenbush and Bryk| 


2002). The test statistic is then 




(3) 


2=1 

which has an approximate Xg*_i reference distribution when the data are normal and the 
group sizes are “large enough.” Here we use g* because “small” groups may be excluded 
from the calculation as they provide less reliable information about the residual variance 
(assuming that there are enough observations to £t the model), but this is a subjective 
choice. A common rule of thumb is to exclude groups with samples sizes smaller than 10. If 
the distributional assumptions are violated, or we do not have large enough group sizes, the 
approximation to the distribution breaks down. In the methylprednisolone study each 
subject was observed at most five times, with 19 subjects dropping out of the study early. 
Due to the small group sizes the approximation is inappropriate, forcing the analyst to 
rely on simulation to construct the sampling distribution of the test statistic, which is not 
only computationally more demanding than the generation of 19 null plots, but lacks power 
with small sample sizes. Additionally, we have found the iJ-statistic to be sensitive to the 
choice of the minimum group size. Table shows the huge difference in conventional p- 
values to those obtained through a bootstrap test (based on 10,000 bootstrap TT-statistics) 
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Figure 7: Lineup of cyclone plots to assess homogeneity of the level-1 residuals between 
groups. Residual values are plotted in horizontal side-by-side box plots by groups. The 
order of the box plots is determined by the interquartile range of the residuals. Of the 73 
observers, 49 chose the data plot as the most different, indicating a signihcant deviation 
from the assumptions of the residual distribution in a model of the observed data. 

for the radon data set. There is a clear discrepancy between the two p-values, indicating 
a need for simulation-based methods. Additionally, the sensitivity of the simulation-based 
test to the minimum group size is apparent. 

If counties with fewer than 10 observations are excluded, the conventional test yields a 
p-value of 0.149; however, if only counties with fewer than 5 observations are excluded, the 
conventional test indicates strong evidence of heterogeneity based on a p-value of 0.0017. 
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Table 1: Are error terms homoscedastic? Raudenbush and Bryk’s conventional test is 
highly sensitive to the choice of minimal gronp size. The naive p-valnes are obtained using 
the approximation. Bootstrap p-values were calculated from the empirical distribution 
of the test statistic based on 10,000 simulated statistics obtained using the parametric 
bootstrap for each minimum group size. There is a clear discrepancy between the two 
p-values indicating a need for simulation-based methods. 


Minimum group size 

H 

d.f. 

Naive p-value 

Bootstrap p-value 

3 

116.6 

73 

0.0009 

0.9178 

4 

96.8 

62 

0.0031 

0.7980 

5 

77.9 

45 

0.0017 

0.6066 

6 

75.8 

38 

0.0003 

0.5313 

7 

59.0 

33 

0.0036 

0.4697 

8 

51.2 

29 

0.0066 

0.4119 

9 

39.6 

26 

0.0426 

0.3509 

10 

27.7 

21 

0.1490 

0.2595 

11 

26.6 

19 

0.1145 

0.2260 

12 

23.7 

17 

0.1281 

0.1952 

13 

23.7 

16 

0.0966 

0.1873 

14 

8.2 

11 

0.6940 

0.1360 

15 

5.1 

7 

0.6429 

0.0764 


This sensitivity to the group size is a clear weakness of the conventional test, and casts 
doubt on its usefulness in situations with small group sizes. Performing the simulation- 
based version of the test in this example results in a p-value of 0.6066, providing no evidence 
of heterogeneity. 

In contrast, consider Figure another lineup of cyclone plots. The data underlying 


this example are radon measurements across counties in Minnesota (see Section B.5). Here, 
level-1 residuals by county are plotted from a model that only includes counties with at 
least five observations. Only one out of 59 participants identified the data plot shown in 
panel — 6) from the lineup, providing no evidence against homogeneity. The visual 
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test is not overly sensitive to group sizes. Counties with fewer than 5 observations were 
eliminated because box plots are not appropriate for such small group sizes, but could be 
included in the representation as dot plots. While we are still slightly constrained by group 
size, we are far less constrained than with the conventional test, and have a clear way to 
choose the minimum group size based on our ability to render box plots. 



Figure 8: Lineup of 20 cyclone plots of level-1 residuals used to test the assumption of 
homogeneous level-1 residual variance. Only one of 59 observers identihed the data plot as 
the most different, providing no evidence against homogeneity. 
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5.2 Linearity 

Scatterplots with smoothers can also be used to check that the relationship between the 
explanatory variables and response variable is in fact linear. Figure shows such a lineup 
testing the linearity of an observation-level explanatory variable. Out of 63 observers, 60 
identihed the true plot in panel 7 ^( 2 ^ -|- 2), providing evidence that the mean structure is 
misspecihed. This example comes from the dialyzer study and considers a model with only 


linear and quadratic terms for transmembrane pressure (see Section B.4). Based on the 
data panel, it is clear that a higher-order polynomial is required. Once a polynomial of 
degree four is included in the mean structure of the hxed effects, the nonlinear pattern in 
the residuals is removed. A lineup for the quadratic model is shown in Figure In this 
lineup, the data plot is identihed due to heteroscedasticity in the residuals. This highlights 
the hexibility of lineups due to the general phrasing of the alternative hypothesis. By 
tracking observers’ reasons for the choice of plot in a lineup, we can distinguish between 
different alternatives. Using the same framework we will get test results based on where 
we are in the modeling process: as long as the mean structure is not correctly specihed, 
it is most likely the distinguishing feature. Once the mean structure is properly specihed, 
the lineup changes to test for homogeneity of variance. 

To extend checks of linearity to group-level variables we suggest the use of the level-2 
residuals. 


5.3 Distributional assessment 

Recall that in model (|^ we assume that the random ehects, hj, are a random sample 
from a multivariate normal distribution and are independent from the error terms, 
which are assumed to be a random sample from a normal distribution. During the model 
htting process, however, the predicted random ehects are the conditional means of the 
random ehects given the data. While in certain situations the empirical distributions of 
the residuals in the linear mixed-ehects model do converge in probability to their true 
distributions, very strong assumptions that are not realistic in hnite samples are required 
Theorem 3.2 and Lemma 3.1). In practice, the predicted random ehects (i.e., 
the level-2 residuals) will not resemble the unconditional distribution; thus, distributional 


(Jiang, 1998 
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Figure 9: Lineup testing for nonlinearity of a covariate. Level-1 residuals (on y) from a 
model of the dialyzer data are plotted against pressure settings (on x). Pressure is clearly 
identified (by 60 out of 62 observers) as an important higher-order covariate of the model. 


tools such as individual Q-Q plots will lead to erroneous conclusions about the distributional 
assumptions. Lineups help to overcome this complication. 

Figures [TO] and [TT] illustrate the use of lineups to test the distributional assumptions in 
a linear mixed-effects model. The null plots for both of these lineups show Q-Q plots of the 
predicted random slopes after re-£tting model (|^ to bootstrap observations generated using 
the parametric bootstrap as outlined in Section [C^ Consequently, the null plots represent 
Q-Q plots of the random slopes for a properly specified model. In Figure pT| the Q-Q plot of 
the predicted random slopes of model (|^ £t to the radon data was inserted into the lineup, 
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Figure 10: Lineup of Q-Q plots assessing the distribution of the random slope for the 
radon data. The lines and confidence bands show the asymptotic distribution. Several of 
the null plots (and the data plot) exhibit strong deviations from the asymptotic normal 
distribution. Compared to the null plots, the data plot does not stand out: none of the 65 
observers identified the data plot as the most different. 

while the lineup in Figure [IT| included a Q-Q plot of the random slopes in model (Q where 
the random effects were simulated from a multivariate fa-distribution. In both lineups, Q- 
Q plots are drawn with lines representing the asymptotic normal distribution and shaded 
confidence bands. It is obvious that in many of the panels the empirical distribution of 
the predicted random effects—both for the null plots and true plot—does not align with 
the asymptotic distribution. This is particularly pronounced in this example because the 
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radon data consist of groups of very uneven sizes, resulting in a high degree of shrinkage. 
While the conhdence bands show the relationship of the predicted random effects to the 
hypothesized distribution, this is known to be an ill-conceived comparison. However, the 
main comparison in lineups is not a comparison of the predicted random effects to the 
normal distribution, but rather a comparison of the empirical distribution of the random 
effects between the null and observed plots. Consequently, conclusions drawn from the 
lineups relate to evidence of consistency between the true plot and what is expected under 
a properly specihed model. For example, the true plot in panel — 6) in Figure 10 is 


indistinguishable from the null plots (none of 65 observers identihed this plot), providing 
no evidence of a violation of (asymptotic) normality; however, when compared only to 
the normal distribution, the observed Q-Q plot would be rejected by any standard test 
for normality (e.g., the p-value of the Anderson-Darling test is .0004 for the data panel). 
Panel ^(^19 was identihed most often from the lineup in Figure [To| it was picked by 29 out 
of 65 observers. Other panels selected at least four times were ^1, 13, 14, 16, and 18. This 
example also shows the impact of confounding between the different levels of the mixed 


effects model on the distribution of the predicted random effects (Toy and Hofmann, 2015): 
even for null plots, where the random effects were generated from a normal distribution 
before re-htting the model, the predicted random effects do not look normal. In fact, the 
Anderson-Darling test of normality rejects the null hypothesis of normality for 16 of the 
null plots at the 0.05 signihcance level. 

To demonstrate that lineups are indeed able to detect distributional violations, we 
constructed another lineup where the “true plot” was based on predicted random effects 
obtained from a model re-ht to data simulated using normal error terms and ts random 
effects. The null plots display Q-Q plots for the random slopes extracted from this simulated 


model. This lineup is shown in Figure 11 Twenty six out of 63 observers identihed the 
true plot in panel 7 )^(a/ 49 -|- 3 ■ 4), providing evidence that we can visually detect a violation 


of normality of the random effects in Figure IT This indicates that lineups of Q-Q plots 
provide an avenue for distributional assessment where conventional methods fail. Further 
investigation is needed to explore limitations of this approach. The ability of a lineup 
to distinguish a t distribution for the random effects in the radon study shows that the 
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Figure 11: Lineup of Q-Q plots assessing the distribution of random slopes. For the data 
plot, the random effects are simulated from a multivariate distribution; the null plots 
show model re-£ts of data with normally distributed random effects. Of the 63 observers, 
26 identified the data plot as the most different from the lineup, providing evidence that 
we can visually detect a violation of normality of the random effects. 

approach has fewer limitations than conventional approaches, justifying our preference. 

6 Protocol 

Throughout this paper we have illustrated how visual inference can help select and diagnose 
LME models. In all of the examples presented, we have followed the same basic protocol: 
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1. Create lineup data: Assuming that the proposed model (for diagnostics) or reduced 
model (for selection) are correct, we use a parametric bootstrap to simulate new 
responses, re-£t the model to these simulated responses, and extract the residuals 
of interest from the proposed model. For each lineup, this process is used to obtain 


m — 1 = 19 simulated null data sets. (See Section C.2 


2. Render lineups: Draw small multiples of each of the null data sets and randomly 
insert the observed data among the nulls. Each plot is labeled by a number from 1 
to m. These IDs are used for identification and later evaluation of results. 

3. Evaluate lineups: Present the lineups to independent observers, instructing them 
to identify the plot most different from the set and asking them what feature led to 
their choice. These choices came in the form of four suggestions (in checkboxes) and 
one text box for a free-form answer. 

4. Evaluate the strength of evidence: For a lineup of size m = 20 that has been 
evaluated by K independent observers, the number of evaluations of a lineup in which 
the observer identihes the data plot, F, has a Visual distribution VK,m,s =3 as dehned 
by Hofmann et al.| (2015). 

In practice, the modeler will not give every lineup rendered to a panel of independent 
observers. Rather, during the model building process many lineups will be rendered with 
the modeler blinded to the true plot. These lineups replace the traditional exploratory 
and diagnostic plots traditionally used, providing additional “protection” against struc¬ 
ture introduced through the model htting procedure. For critical decisions—perhaps, to 
hnalize model selection or diagnose borderline situations—recruiting a number of indepen¬ 


dent observers is warranted. While we used the MTurk service (Amazon.com, Inc, 2015), 
modelers could also use their colleagues, provided that they are making these evaluations 
independently. 

For each of the lineup designs described in the paper we constructed hve replicates 
consisting of the same data plot and different sets of nineteen null plots for a total of 75 


different lineups. 487 participants were recruited through the MTurk service (Amazon.com, 


Inc, 2015), each participant was asked to evaluate ten lineups, for a total of altogether 4927 


evaluations. In addition to the plot choice and rationale, the time taken to answer was 
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recorded and observers were asked for their confidence level (on a scale from l=weak to 
5=higli). Observers were also asked to provide their demographics: age category, gender, 
education range, and geographic location (from parts of their ip address). It has been 
shown that demographics and educational background of observers, while signihcant, do 


not have a practical impact on detection rates of the true plot from a lineup (Majumder 


et al., 2014). 


The MTurk service provides access to a large pool of participants, whose demographics 
reflect those of internet users (i.e. relative to the average U.S. population, participants are 
younger and more highly educated; the proportion of female participants is typically around 
40%). To be eligible to participate in the study, we required MTurk workers to have at 
least 100 completed HITs (Human Intelligence Tasks) with a success rate of at least 90%. 
Assuming that these workers have a vested interest in their reputation, we hope that such a 
selection reduces the rate of participants gaming the system. The pool of MTurk workers is 
large enough to allow for very fast collection of lineup evaluations; all of the data presented 
here were collected within a few hours. 

While we do not require any statistical training from participants, training in math¬ 
ematical reasoning has been found to signihcantly help with performance on lineup tests 
(see 


Vander Plas and Hofmann, 2016). 


7 Discussion 

We have presented a graphical approach to model selection and diagnosis using lineups 
constructed by simulation from the model. Lineup tests provide us with the framework to 
test hypotheses and also allow a subsequent exploration of the plots in the lineup for addi¬ 
tional insight into the data structure. Thus, instead of simply rejecting the null hypothesis, 
the use of lineups also allows us to explore why we are rejecting the null hypothesis. This 
approach relies on the simulation process, the design of the graphics created, and observers, 
but avoids the reliance on asymptotic reference distributions; thereby circumventing the 
pitfalls of many commonly used tests. While the value of lineup tests is, perhaps, most 
obvious in such problematic situations, the approach provides an avenue for testing in all 
situations, even those in which commonly used tests are appropriate. 
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The graphical approach is relatively new and involves working with human observers 
recruited on the web. There is a vast experience in statistics in engaging subjects in a 
traditional lab setting where researchers are closely engaged with the experiment. With 
MTurk, researchers are working with subjects in a long distance relationship, and there are 
some participants who will try to “game the system.” Nevertheless, results are promising, as 


MTurk experiments have replicated studies conducted in the traditional lab setting (Crump 


et ah, 

2013 

). Specihcally, 

Heer and Bostock 

(2010 

) achieved matching results to those of 

Cleveland and McGill 

(1984 

) using MTurk, and 

Kosara and Ziemkiewicz 

(2010 

) found 


similar results between a lab study and one performed using MTurk. 

As with conducting surveys, avoiding leading questions is very important. For this 
study, observers were generically asked to pick the plot with the most different features. 
Contextual information, such as labels, axis tick marks, and titles, were removed to avoid 
subjective bias. Care was taken in constructing lineup sequences and different plot designs 
so that observers saw the data only once. Because of the hnite nature of possible comparison 
in a lineup, multiple replications of lineups (e.g., hve, as in this study) are recommended 
using different sets of m — 1 null plots. 

By asking observers for the distinguishing feature(s) of the plot they chose, graphical 
tests also provide us with information about the specihc violation(s) of the null that is (are) 
captured in a general alternative hypothesis. A hxed selection of reasons were provided for 
observers to choose from, but they could also enter a different reason in a text box. This 
allowed us to examine responses by reasons for selection to assess individual sub-hypotheses 
and investigate different types of errors. For example, in Figure some participants chose 
plots because they showed the “most straight line” or “closest £t” to the line. These 
reasons would indicate that the person may not have had prior experience in reading Q-Q 
plots, and mistakenly looked for compliance with a theoretical distribution. This can be 


interpreted as a Type II error. Type III errors (Mosteller, 1948), where the null hypothesis 
is rejected for the wrong reason, can also be detected by investigating participants’ choices. 

Plot design is important—some designs are better at revealing anomalies than others. 
At a population level, the results obtained from the same plot design are very stable. In 
this paper, design choices were made using our best judgment based on experience and 
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results from cognitive psychology. As more studies are conducted more about the power of 
designs for model diagnostics will be learned, enabling more informed decisions about best 


practices (Loy et ah, 2015 Hofmann et al., 2012). A further beneht of using the graphical 


framework for testing is that graphics adapt relatively well to big data situations (Unwin 


et al., 2006). This provides us with a viable approach to assess the practical relevance of a 


result versus results that show statistical signihcance purely based on the dimension of the 
problem. Barring bad design choices, graphical tests allow us to judge practical relevance 
of a result as “If we do not see it, it might be there but not be relevant.” 

While we have tried to present a variety of different plots throughout this paper, one of 
the strengths of the graphical approach is that a single visualization often provides insight 
into different aspects of the model. For example, a plot of a continuous covariate and the 
residuals of a corresponding model enables us to investigate the presence of a non-linear 
trend, the amount of heteroscedascity (with respect to that covariate), and, to a degree, 
reveals features of the marginal distribution of the residuals, such as its skewness. The 
lineup protocol allows us to simultaneously assess several model assumptions. Very few 


conventional tests allow for this. The paper by Pena and Slate (2006) is a rare exception: 
here, the authors present a test for the global validation of assumptions in a linear model. 
Alas, the test does not allow for a ready extension to the hierarchical model. 

The diagnostics in this paper draw from various sources. Some of the diagnostics 
presented here are well-known diagnostic tools, such as Q-Q plots or scatterplots of residuals 


with trend lines as suggested by Cook and Weisberg (1999) for ordinary least squares 
regression. Some diagnostics are suggestions from the literature specihc to mixed effects 
models. An example of a new diagnostic addressing a practical need are the cyclone plots 
of Figure The overarching purpose of these examples is to show visual diagnostics in a 
wide variety of situations that all need special consideration in the conventional hypothesis 
testing setting, but that all £t within the same graphical inference framework. Many 
situations, such as outlier detection, were not discussed in this paper, but the results also 
extend to these problems. The reader is encouraged to examine more examples provided 


by Buja et al. (2009) and Majumder et al. (2013), as well as Roy Chowdhury et al. (2014) 


for biological applications. 
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Model Choice and Diagnostics for Linear 
Mixed-Effects Models Using Statistics on 
Street Corners (Supplementary Materials) 


The materials in this document supplement the information presented in the manuscript 
“Model Choice and Diagnostics for Linear Mixed-Effects Models Using Statistics on Street 
Corners.” Section gives an overview of the theoretical framework of linear mixed effects 
models. Section [B] describes the data sets used to illustrate the use of visual inference in the 
paper. Section 0 details the experimental setup, generation of null plots, the calculation 
of p-values, and additional lineups that were included in the study, but omitted from the 
paper for brevity. The hgure numbering carries on from the paper for ease of reporting 
results. 


A Model overview 

Linear mixed-effects models can account for dependence structures when data are composed 
of groups. Such structures occur, for example, when individuals are naturally grouped by 
organization (e.g., students within schools), geography (e.g., voters within states), or design 
(e.g., respondents assigned to interviewers). The models allow data to be incorporated 
at both the observation-level (level 1) and the group-level (level 2, or higher) while also 
accommodating dependencies between individuals within the same group. 

For data organized in g groups, consider a continuous response linear mixed-effects 
model (LME model) for each group i, i = 1,..., g: 

Vi = Xi (3 + Zi bi + Si (4) 

(niXl) (riiXp) (pxl) (riiXq) (gxl) (n^xl) 

where yi is the vector of outcomes for the rij level-1 units in group i, Xi and are 
design matrices for the hxed and random effects, respectively, /3 is a vector of p hxed 
effects governing the global mean structure, h* is a vector of q random effects describing 
the between-group covariance structure, and Si is a vector of level-1 error terms accounting 
for the within-group covariance structure. The random effects, hj, are assumed to be a 
random sample from A/'(0, D) and independent from the level-1 error terms, £*, which are 
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assumed to follow a Af{0, a'^Ri) distribution. Here, Z) is a positive-definite qxq covariance 
matrix and Ri is a positive-definite Ui x rii covariance matrix. Finally, it is assumed that 
all between group effects have a covariance of zero. 

Inference typically centers around either the marginal or conditional distribution of yi, 
depending on whether global or group-specihc questions are of interest. Based on model 
(|^ the marginal distribution of yi for alH = 1,..., 5^ is given by 

y,^f^{X,f3, Vi), ( 5 ) 

where Vi = ZiDZ[ -|- a^Ri, and the conditional distribution of yi given bi is dehned as 

yi\bi Af [Xif3 + Zibi, a^Ri) . ( 6 ) 

Similar to simple linear models, residuals form the diagnostic core of a LME model. But, 
LME model residual analysis is complicated by the fact that there are numerous quantities 
that can be dehned as residuals, with each residual quantity being associated with different 
aspects of the model. The two fundamental residuals for model checking considered here 
are: 

• the level-1 (observation-level) residuals, the conditional residuals or error terms: 

— yi Xil3 Zibi, 

• and the level-2 (group-level) residuals, the predicted random effects bi 
where (3 is an estimate of the hxed effects, 

13= {Y.X[V-^x)\ ( 7 ) 

\i=l / i=l 

and bi are predictions of the random effects, given as 

b, = DZ[Vr^ {y, - X^) , V ^ = 1,..., ^ 7 . ( 8 ) 

When Vi is unknown, estimates for the covariance matrices are used in the above equations. 
These estimates are commonly found through maximum likelihood (ML) or restricted max¬ 
imum likelihood (REML). 
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B Data sets 


All of the data sets used in this paper are publicly available: the General Certificate of 
Secondary Education Exam data set is available in the R package mlmRev ([Bates et al.| 


2011); the Dialyzer data set is available in the R package MEMSS (Bates et ah, 2012); all 


other data sets can be found in the R package HLMdiag (Loy, 2013). 


B.l General certificate of secondary education exam data 

We make use of a subset of examination results of 4,065 students nested within 65 inner- 


London schools discussed by Goldstein et ah (1993). The original analysis explored school 


effectiveness as dehned by students’ performance on the General Certihcate of Secondary 
Education Exam (GCSEE) in both mathematics and English. This exam is taken at 
the end of compulsory education, typically when students are 16 years old. To adjust 
for a student’s ability when they began secondary education, the students’ scores on the 
standardized London Reading Test (LRT) and verbal reasoning group (bottom 25%, middle 
50%, or top 25%) at age 11 were recorded. Additional information contained in the data set 
includes student gender, school gender, and the average LRT intake score for each school. 


B.2 Autism study 

In an effort to better understand changes in verbal and social abilities from childhood 


to adolescence, Anderson et ah (2007, 2009) carried out a prospective longitudinal study 


following 214 children between the ages of 2 and 13 who had been diagnosed with either 
autism spectrum disorder or non-spectrum developmental delays at age 2. The Vineland 
Adaptive Behavior Interview survey was used to assess each child’s interpersonal relation¬ 
ships, play time activities, and coping skills, from which the Vineland Socialization Age 
Equivalent (VSAE) was computed as an overall measure of a child’s social skills. Addition¬ 
ally, expressive language development at age 2 was assessed using the Sequenced Inventory 
of Gommunication Development (SIGD) and the children were classified into three groups 
(high, medium, or low). Assessments were made on the children at ages 2, 3, 5, 9, and 
13, however, not all children were assessed at each age. Additional information collected 
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on each child includes: gender, race (white or non-white), and initial diagnosis at age 2 
(autism, pervasive development disorder (pdd), or non-spectrum). We restricted attention 
to models concerned with the changes in social skills for subjects diagnosed with autism 
spectrum disorder having complete data. This results in a reduced data set of 155 children. 


For more detailed analyses we refer the reader to Anderson et ah (2007, 2009). 


B.3 Methylprednisolone study 


Carithers et ah (1989) conducted a four week longitudinal study to investigate the effec¬ 


tiveness of methylprednisolone to treat patients with severe alcoholic hepatitis. The re¬ 
searchers randomly assigned 66 patients to receive either methylprednisolone (35 patients) 
or a placebo (31 patients). Over the study duration, each subject’s serum bilirubin levels 
(in /imol/L) were measured each week, with the hrst measurement taken at the start of 
the study (week 0). 


B.4 Dialyzer study 


Vonesh and Carter (1992) describe a study characterizing the water transportation char¬ 


acteristics of 20 high flux membrane dialyzers, which were introduced to reduce the time 
a patient spends on hemodialysis. The 20 dialyzers were studied in vitro using bovine 
blood at flow rates of either 200 or 300 ml/min, and the ultrahltration rate (ml/hr) for 


each dialyzer was measured at seven transmembrane pressures (in mmHg). Vonesh and 


Carter (1992) use nonlinear mixed-effects models to analyze these data; however, they can 


be modeled using polynomials in the linear mixed-effects framework (see Littell et ah, 2006 
Section 9.5). 


B.5 Radon study 

The data consist of a stratihed random sample of 919 owner-occupied homes in 85 counties 
in Minnesota. For each home, a radon measurement was recorded (in log pCi/L, i.e., log 
picoCuries per liter) as well as a binary variable indicating whether the measurement was 
taken in the basement (0) or a higher level (1). Additionally, the average soil uranium 
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content for each county was available. The number of homes within each county varies 
greatly between counties ranging from one home to 116 homes, with 50% of counties having 


measurements from between 3 and 10 homes. Gelman and Pardoe (2006) suggest a simple 


hierarchical model allowing for a random intercept for each county and a random slope for 
floor level. This is the model from which we simulate predicted random effects. 


C Experimental Setup and Results 

C.l Experimental Setup and Calculation of values 

For each of the lineup designs described in the paper we constructed hve replicates consist¬ 
ing of the same data plot and different sets of nineteen null plots for a total of 75 different 
lineups. These were evaluated by 487 participants in altogether 4927 evaluations. For each 
lineup, observers were instructed to identify the plot most different from the set and asked 
what feature led them to their choice. These choices came in the form of four suggestions 
(in checkboxes) and one text box for a free-form answer. For each lineup the time taken 
to answer was recorded and observers were asked for their conhdence level (on a scale 
from l=weak to 5=high). Observers were also asked to provide their demographics: age 
category, gender, education range, and geographic location (from parts of the ip address). 

The results of the evaluations for all lineups are displayed in Table and the observers’ 
reasons for identifying plots are summarized in Table The signihcances in Table 
are based on the number of evaluations and the number of times that the data plot was 
identihed. For that, we the introduce the random variable Y as the number of evaluations 
of a lineup in which the observer identihes the data plot. Assume that the lineup has size 
m = 20, and it is shown to a total of K independent observers. Then Y has a Visual 


distribution VK,m,s =3 as dehned in Hofmann et ah (2015), where s delineates scenario III 


i.e., the same lineup is shown to all K observers. The p-values in Table for each of the 
hve replicates are calculated this way. For the overall p-value, we use a simulation based 
approach to combine the hve results. Treating the number of evaluations (iFi, as 

hxed, we simulate assessments of lineups without signal as follows: we assume that the 
signal in a plot is complementary to its p-value, which is i.i.d. U[0, 1] under the null 
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hypothesis. We further assume that the probability an observer picks a plot is allocated 
proportionally to its signal. For each “data plot” we create five sets of null plots to be 
evaluated simultaneously {Ki,K^) times, p-values are then based on a comparison of 
the sum of data picks from five no-signal lineups and the observed number of data picks 
from the actual lineups. The column on the right of Table shows p-values based on 10^ 
simulation runs. 
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Table 2: Overview of all lineup evaluations. Ratios comparing the number correct to 
the total number of evaluations are shown, p-values and signihcances are based on the 




Replicate 



Overall 

1 

2 

3 

4 

5 

p-values 

10/68 * 

7/65 

8/61 . 

13/61 *** 

6/66 

0.0022 

0/64 

7/75 

11/76 * 

0/69 

0/60 

0.4202 

60/68 *** 

51/59 *** 

59/64 *** 

51/60 *** 

62/71 *** 

< 10-^ 

36/63 *** 

58/69 *** 

30/60 *** 

51/68 *** 

52/63 *** 

< 10-^ 

26/80 *** 

9/59 * 

23/60 *** 

7/55 . 

11/69 * 

< 10-^ 

23/70 *** 

9/74 . 

11/62 ** 

31/78 *** 

25/61 *** 

< 10-^ 

49/73 *** 

41/59 *** 

41/59 *** 

40/66 *** 

49/65 *** 

< 10-^ 

1/59 

2/79 

2/68 

4/62 

1/72 

0.6567 

52/55 *** 

60/62 *** 

49/52 *** 

79/83 *** 

63/67 *** 

< 10-^ 

26/63 *** 

48/75 *** 

0/59 

11/56 ** 

6/69 

< 10-^ 

0/72 

1/75 

0/68 

0/75 

2/61 

0.8904 

0/65 

0/64 

0/68 

0/46 

0/64 

1.0000 

48/76 *** 

26/55 *** 

28/63 *** 

30/70 *** 

30/60 *** 

< 10-^ 

0/57 

1/71 

8/75 

0/59 

2/68 

0.6448 

23/61 *** 

38/72 *** 

29/59 *** 

22/70 *** 

35/62 *** 

< 10-^ 


Lineup 


Hiijii 


fig- 

fig- 

fig- 

fig- 

fig- 

fig 


12 


liiij. 


fig- 

fig- 

fig- 

fig 


11 



fig- 

13 


fig- 

14 


fig- 

15 


Signif. codes: 0 < 




< 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < ’ ’ < 1 
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Table 3: Percent of data picks, given the reason for the choice of plot from the lineup. 



Lineup 


Outlier 

Spread 

Trend 

Asymmetry 

Other 


fig. 2 

12.8 

24.0 

10.9 

5.8 

15.9 


fig. 4 

2.4 

10.6 

3.5 

5.4 

0.0 


fig. 3 

73.3 

95.5 

92.4 

91.0 

80.2 


fig. 5 

64.5 

34.3 

83.4 

70.7 

64.9 


fig. 6 

19.0 

27.3 

29.6 

25.8 

19.0 


fig. 

12 

26.7 

24.6 

30.1 

43.7 

0.0 

w 








I 

fig. 7 

49.6 

49.5 

77.7 

79.4 

91.8 

% 

fig. 8 

4.2 

0.6 

2.4 

6.8 

0.0 


fig. 9 

84.0 

87.4 

98.3 

98.1 

100.0 


fig. 11 

37.8 

49.2 

19.6 

4.4 

10.9 

random intercept 

1.5 

0.0 

1.1 

0.0 

0.0 

random slo 

pe 

0.0 

0.0 

0.0 

0.0 

0.0 

A 

fig. 

13 

39.2 

33.8 

57.9 

78.5 

70.0 

jl'!g g ‘£.;^ 

fig. 

14 

2.4 

3.0 

7.6 

4.8 

0.0 


fig. 

15 

56.8 

55.1 

20.1 

33.9 

22.7 
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C.2 Generating null plots 

All of the lineups presented in this paper use a parametric bootstrap to generate plots 
consistent with the null hypothesis. This section outlines the parametric bootstrap for 
LME models and provides more detail about how the null plots were generated. 

For a htted continuous response LME model (as outlined by Equations (|^-([^ and the 
intermediate discussion) the parametric bootstrap proceeds as follows: 

1. Generate a vector of q random effects (i.e., level-2 residuals) from A/'(0, D) for each 
group; that is, generate b* A/'(0, . 5 ) fori = 1 ,...,^. 

2. Generate a vector of residuals of length Ui (i.e., level-1 residuals) from A/'(0, a‘^Ri) 

for each group; that is, generate £* A^(0, a‘^Ri) fori = 1 ,...,^. 

3. Generate a bootstrap sample y* for each group i = 1,..., g from y* = Xi(3+Zib*+e*. 

4. Reht the model to the bootstrap samples. 

5. Repeat steps 1-4 B times. 

The parametric bootstrap was used to generate the null plots in each situation. During 
model selection the simpler models are used to generate the null plots using the parametric 
bootstrap, while model checking bootstraps the original model to generate null plots. 

Selecting fixed effects. To use lineup tests to determine whether a variable should be 
included in the htted model we must generate null plots from a model excluding the variable 
in question. Let denote the design matrix for the hxed effects with the cth column 
deleted. We use the model 

Vi — + Zibi -\- Si 

to generate the null plots. 

Selecting random effects structure. To use a lineup test to determine whether a 
random effect should be included in the htted model we must generate null plots from a 
model excluding the random ehect in question. This results in a model of the form of (|^ 
with the column of Zi corresponding to the random ehect in question deleted, and a random 
ehects vector bi of length q — 1. To determine whether it is necessary to allow the random 
ehects to be correlated, the null plots are generated using a model where the covariance 
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matrix of the random effects D has zero entries in the appropriate off diagonal entries. 

Model checking. Generating the nnll plots used for model checking is a direct applica¬ 
tion of the parametric bootstrap as detailed above to obtain B htted models from which 
the appropriate aspects are extracted. 


C.3 Additional lineups included in the study 


This section includes two lineups that were included in the MTurk study, but were not 
discussed in the paper. 

Figure [T^ contains a box plot representation of the same data as Figure 5 in the paper, 
but categorizes pressure into seven categories, and shows residuals in the form of box plots. 
In order to preserve the appearance of continuity on the x-axis we used a color scheme to 
hll the boxes with deepening shades of blue from left to right. In this form 23 out of 70 
observers identify the plot of the data. This is consistent with the other design. 

Figure [T^ displays another lineup testing the adequacy of the random effects specihca- 
tion (see Section 3.2 of the paper) using data from the autism study. The null plots were 
generated from a model containing only a linear random slope, so if the true plot in panel 
+ 12) is identihed it provides support for the inadequacy of this specihcation, and 
the need for additional random effects. 


Figures and show another example of testing for homogeneity in the variance 
following the approach taken in Section 4.1 of the paper. Both of these lineups are based 
on the dialyzer data. Level-1 residuals are plotted by subject. Subjects are ordered by 
variance—i.e., we get some structure that might be taken for differences in variability, that 
are really just differences due to the imbalance in group size. If any panel of this lineup 
is considered separately, an analyst may come to the conclusion that the within-group 
variance increases across the x axis. However, inserting the true plot into the lineup forces 
the analyst to consider this particular feature as inherent to the data structure rather 
than evidence against a hypothesis of homogenous variance. The dot plot version is not 
signihcant, but the box plot version is. When participants in the box plot design identify 
the data plot, about 45.3% give outliers as the reason for their choice. In contrast to that. 
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Figure 12: Alternative box plot Lineup testing homogeneity of the level-1 residuals. Which 
of the plots is the most different? Which feature led you to your choice? 

outliers, a large spread or a trend are in a three-way tie for the reason for identifying the 
data plot in the lineup with the dot plot design. 

The reason for the box plot design being so much more significant might not be so much 
of an issue of homogeneity being violated as much as a difference in the error distribution 
between the data and the nulls. Null data come from a parametric bootstrap where residuals 
are simulated under a normal error assumption. Outliers in small samples are indicative 
of the sample being from a distribution with heavy tails. Regardless of the reasoning, the 
second lineup design enables us to diagnose a problem with the model that makes the data 
stand out from a set of nulls. The two designs therefore represent two very similar tests 
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Figure 13: Which of the plots is the most different? Which feature led you to your choice? 
with different power. 
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Figure 14: Lineup testing homogeneity of the level-1 residuals. Which of the plots is the 
most different? Which feature led you to your choice? 
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